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,__( , Abstract 

i^ . The path-difference metric is one of the oldest and most popular distances for the comparison of phylogenetic 

' trees, but its statistical properties are still quite unknown. In this paper we compute the expected value 

C^ . under the Yule model of evolution of its square on the space of fully resolved rooted phylogenetic trees with 



(N 



O^' 



n leaves. This complements previous work by Steel-Penny and Mir-Rossello, who computed this mean value 
for fully resolved unrooted and rooted phylogenetic trees, respectively, under the uniform distribution. 
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1. Introduction 



d 

r^ i The definition and study of metrics for the comparison of rooted phylogenetic trees on the same set of 

taxa is a classical problem in phylogenetics [6l Ch. 30]. A classical and popular family of such metrics is 
based on the comparison, by different methods, of the vectors of lengths of the (undirected) paths connecting 
all pairs of taxa in the corresponding trees [J, y, llJ, l20| • These metrics are generically called nodal distances, 
although some of them have also specific names. For instance, the metric defined through the euclidean 



> : distance between path-lengths vectors is called path-d^fference metne M, or clad^sUc difference l4|. 

f— ^ . In contrast with other metrics, the statistical properties of these nodal distances are mostly unknown. 

l/~j ' Actually, the only statistical property that has been established so far for any one of them is the expected, 

CN ' or mean, value of the square of the path-difference metric for unrooted [18| and rooted [ll| fully resolved 

^r^ I phylogenetic trees under the uniform distribution (that is, when all phylogenetic trees with the same number 

f^ . of taxa are equiprobable). The knowledge of the expected value of a metric is useful, because it provides an 

Cn ' indication about the significance of the similarity of two individuals measured through this metric [l8[ . 

But phylogeneticists consider also other probabilistic distributions on the space of phylogenetic trees on 

a fixed set of taxa, defined through stochastic models of evolution [6|, Ch. 33]. The most popular such model 

k^ . is Yule's [71, |21[, defined by an evolutionary process where, at each step, each currently extant species can 

V^ I give rise, with the same probability, to two new species. Under this model, different phylogenetic trees with 

C^ . the same number of leaves may have different probabilities. Formal details on this model are given in the 

next section. 

In this paper we compute the expected value of the square of the path-difference metric for rooted fully 
resolved phylogenetic trees under this Yule model. Besides the aforementioned application of this value in 
the assessment of tree comparisons, the knowledge of formulas for this expected value under different models 
may allow the use of the path-difference metric to test stochastic models of tr ee g rowth, a popular line of 
research in the last years which so far has been mostly based on shape indices [131 ]. 
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2. Preliminaries 

In this paper, by a phylogenetic tree on a set S of taxa we mean a fully resolved, or binary, rooted tree 
with its leaves bijectively labeled in S. We understand such a rooted tree as a directed graph, with its arcs 
pointing away from the root. To simplify the language, we shall always identify a leaf of a phylogenetic tree 
with its label. We shall also use the term phylogenetic tree with n leaves to refer to a phylogenetic tree on 
the set {1, . . . , n}. We shall denote by T{S) the space of all phylogenetic trees on S and by % the space of 
all phylogenetic trees with n leaves. 

Whenever there exists a directed path from u to w in a phylogenetic tree T, we shall say that w is a 
descendant of u. The distance dT{u,v) between two nodes u,w in a phylogenetic tree T is the length (in 
number of arcs) of the unique undirected path connecting u and v. The depth 6t(v) of a node w in T is the 
distance from the root r of T to f . The path- difference distance [J, |5| between a pair of trees T,T' ^ Tn is 



dAT,r)= J2 {dT{i,j)-dT'{i.j)Y. 

The Yule^ or Equal-Rate Markov^ model of evolution [7|, |2l| is a stochastic model of phylogenetic trees' 

growth. It starts with a node, and at every step a leaf is chosen randomly and uniformly and it is splitted 

into two leaves. Finally, the labels are assigned randomly and uniformly to the leaves once the desired 

number of leaves is reached. Under this model, if T is a phylogenetic tree with n leaves and set of internal 

nodes Vint{T), and if for every internal node v we denote by (!.t{v) the number of its descendant leaves, then 

the probability of T is [l|, |l3| 

2^— 1 -^ 

^-(^)-^ n j^^;f^r 

For every n ^ 1, let i/„ = Ya^i 1/« and H^ii^ = X;"=i l/«^- Let, moreover, H^ ^ H^^ = 0. H„ is called 
the n-th harmonic number^ and Hn , the n-th generalized harmonic number of power 2. 

3. Main results 

Let N'^ the random variable that chooses independently a pair of trees T,T' G Tn and computes 

dAT,T'f= J2 {dT{i,j)-dT'{i,])f. 

In this section we establish the following result. 

Theorem 1. The expected value of N^ under the Yule model is 

EyiNl) = -^(2(n2 + 24n + 7)iJ„ + ISn^ - 46n + 1 - Wn+l)Hl - Sfn^ - l)H^^). 

71 — 1 ^ ' 

To prove this formula, we shall use the following auxiliary random variables: 

• Dn is the random variable that chooses a tree T E Tn and computes D{T) — ^ dT{i,j)- 

• Dn is the random variable that chooses a tree T E Tn and computes D^'^^{T) = ^ dxiijj)"^- 

The connection between i?y (7V^) and the expected values under the Yule model of I?„, Dn is given by 
the following result. 

Proposition 2. Ey{N^,) - 2{EYiDl?) - EY{Dn)y Q). 

2 



Proof. Let us develop Ey{N^) from its raw definition: 

EYiN^)^J2 d,{T,T'fpY{T)py{T')^Y. ( E {dT{i,:i) ~ dT.{i,j)f)py{T)py{T') 

= E (E^^(*'^')V(TW(T') + ^dT'(*,j)V(rW(r') 

l^i<jCn T,T' T,T' 

-2 ^ dT(z,j)dT'(i, j)pY(T)py(T')) 
T,T' 

= E (E'^^(*'^')'P^(^)+E^^'(*'j)V(T')-2(^dT(*,J>y(T))(^dT'(*,J>y(T') 

l^'<j^n T T' T T' 

= 2E( E dT{i,3f)pY{T)-2 J2 {Y.dTihj)PY{T)y 

= 2£;y(A^2))_2Q(^d^(f,2W(T))' 

and now 

EyiD,,)^Y. E '^T(*,J>y(T)= ^ ^dT(*,j>r(T)=[^')^dT(f,2W(r) 

from where we deduce that I ^ dril, 2)py{T) j = Ey{Dny /(J^^ , and the formula in the statement follows. 

D 
Now, it is known that the expected value under the Yule model of £)„ is 

Ey(Dn) = 2n{n + 1)H„ - 4n^ ^. 

(2) 

As far as Ey{Dn ) goes, its value is given by the following result. We postpone the proof until the appendix 
at the end of the paper. 

Theorem 3. Eyiol?^) = 8n{n + l){Hl - H^^^) - 2n(15n + 7)i/„ + ibn^ - n 

Then, replacing in the expression for Ey{N^) given in Proposition [5] , Ey{Dn) and Ey{Dn ) by their 
values, we obtain the formula for Ey{N^) given in Theorem [T] 

4. Conclusions 

In this paper we have computed the expected value Ey{N^J of the square of the path-difference metric 
for rooted fully resolved phylogenetic trees under the Yule model: 

Ey{N^) = -^^{2{n^ + 2An + 7)Hn + 13n^ - 46n + 1 - 16(^+1)^2 _ 8{n^ - l)H^^). 



This complements the computation of this expected value under the uniform distribution carried out in 11 1 
which turned out to be 



2, ^,n\ I ., ,, _ 22("-i) /22("-i) ' ^ 



Eu{K)^2\ 1 4(n-l) + 2- 



2 / \ ^ ' /'2("-l)^ I /'2("-i)\ 



The proof of the formula for Ey{N^) consists of several long algebraic manipulations of sums of sequences. 
Since it is not difficult to slip some mistake in such long algebraic computations, to double-check our result 
we have directly computed the value of Ey{N^) for n = 3, . . . , 7 and confirmed that our formula gives 
the right figures. The Python scripts used to compute them and the results obtained are available in the 
Supplementary Material web page http: /bioinfo. uib.es/~recerca/phylotrees/nodaldistYule/, 

The formulas for Ey{N^) and Eu{Nl) grow in different orders: Ey{NI) is in 0{n^ ln(n)), while EuiN^) 
is in 0{n^). Therefore, they can be used to test the Yule and the uniform models as null stochastic models 
of evolution for collections of phylogenetic trees reconstructed by different methods. This kind of analysis 
has only been performed so far through shape indices of single trees, not by means of the comparison of 
pairs of trees. We shall report on it elsewhere. 
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Appendix 

In this appendix we prove Proposition [31 as well as of some preliminary lemmas. To begin with, the 
following identities on harmonic numbers will be systematically used in the next proofs, usually without any 
further notice. 

Lemma. For every n ^ 2: 



(1) J2 Hk = n{H„ - 1) 



k=l 



n-1 

(2) J2 kHk^\n(n-\){2Hr,-\) 

k=l 



(3)''j:H,/{k + l)^\{Hl-Hl?) 



fc=i 

n-l 

fe=i 



r^; ;^' fci/fci/„-fe = ("+!) (i/„%i - i/,i?i - 2i/„+i + 2) 



(5) Em ^ <)) = n{Hl Hi'^) - 2n(iJ„ - 1) 

n—1 

(6) Y. KHl - H^k^) - (2)(^n - H^^) - \<n m^H,, 1) 

fc=i 

Proof. Identities (l)-(3) are well known and easily proved by induction on n: see, for instance, [? , §6.3, 
6.4] and [10|, §1.2.7]. Identity (4) is proved in [19|, Thm. 2]. We shall prove (5) and (6) using the technique 
introduced in [3J. The main ingredient is Abel's lemma on summation by parts: for every two sequences 
(afe)fe and {bk)k, 

n—l n—1 

^(flfc+i - ak)bk = - ^(6/c+i - bk)ak+i + a„6„ - aibi. 
fe=i fc=i 

To prove (5), take a^ — k and 6^ = H^, — H^, ' , so that ak+i — a^, = 1 and bk+i — &fe = '^Hkl{k + 1). Then, 
by Abel's lemma 

n— 1 n—l o u n — l 

5:(iJ,^-ilf ) = -^(fe + l)^+n(if^-ij(^)) = n(iJ^-if(2))-2^i/. =n(i/,^i/(2))-2n(i?„-l). 
fe=i fe=i fc=i 

To prove (6), take a^ — (2), so that Ok+i — ak = k, and &fe = iJ| — iJj, . Then, again by Abel's lemma, 

n 

Let us consider now the following two random variables: 

• Sm that chooses a tree T ^ Tn and computes its Sackin index [l5| S{T) = ^ (5T(i)- 

i=l 

• 5*4 , that chooses a tree T £ Tn and computes S'^^^(T) = ^ ^t(«)^- 
It is known that the expected value under the Yule model of 5*^ is 

EY{Sn)^2n{Hr,-l) ^. 

(2) (21 

We shall compute now the expected values under this model of 5„ and Dn : the first will be used in the 
computation of the second. To do this, we shall use the following recursive expressions for S^'^\T^T') and 



Lemma. Let T,T' be two phylogenetic trees on disjoint sets of taxa S,S', with \S\ — k and |S"| = n — k. 
Then: 

(1) S^^Ht^T') = 5(2) (T) + 5(2) (T') + 2{S{T) + 5(T')) + n 

(2) D^^HT^T) ^ D^^^{T) + D^^HT') + {n- k){S^^\T) + 4S(T)) + k{S^^\r) + AS(T')) + 2S(T)S{r) + 
4fc(n - k) 

Proof. Let us assume, without any loss of generality, that S = {1, . . . , fc} and 5' = {fc + 1, . . . , n} . Then, 
as far as (1) goes, we have that 

°T T'\^l - \ {5T-{i) + lf iffc + ls;zs^n 
and therefore 



k n 

- Y.^Srii)'' + 25t{i) + 1) + E ^^^■{1)'' + 15t'{i) + 1) = 5(2) (T) + 2S{T) + 5(2) (T') + 2S{T') 

i=l i=k+l 

As far as (2) goes, we have that 

r dT{i,jf if ls^z<js$/c 

dr^^^Xi^jf "^ { dT'{i,3f iik+l^i<j^n 

[ {ST{i) + ST'{j} + 2f iil^i^k<j^n 

and therefore 



i?(2)(T-T') = J2 dr-T'i'^j)' = E ^^(^'•?)' + E ^^'(*' J')' + J2(^Ti^) + 5t'{j) + 2) 

l<i<i<n l<i<j<k k+l<i<j<n l^isjfc 

= 7^(2)(r) + D(^\r) + ^((5t(z)2 + ST'Uf + 25T{i)5T'{j) + 4<5t(*) + 4Jt'(j) + 4) 



k + l^jf^n 

k 



k n 

+2(^Jt«)( E '5T'(j))+4fc(n-fc) 

«=1 j=k+l 

= L»(2)(T) + d'-^\T') + {n- /c)(5(2)(T) + 45(T)) + k{S'-^^T') + 45(T')) + 2S{T)S{T') + Ak{n - k). 

U 
Now we can compute explicit formulas for Eyi^Sn ) and Ey^Dn ) 
Proposition. Ey{S^n^) = An{Hl - M^') - 6n(iJ„ - 1). 



Proof. Wc compute EyiSn ) using its very definition: 

^ n— 1 

Evis'^^) - E S^'^^T) -priT) - i E E E E S^'Hn-TU) -pvin-TU) 

n-1 



fe=i \ J TkeTkT'^_^eT^-k 

'^ Py{Tk)PY{TL-k) 

= ;7^E (E E ^^'Hrfe)Py(T,)p^(r^,) + E E s^^HTU)Py{Tk)PY{TU) 



Tk T' ,. Tfe T' ^. 



+EE"^^(^^)^^(^«-fc), 

^E (E^^'^(^^)^>'(^'^)+ E 5(^'(T;_fc)Pr(T;_,) 

*:=1 Tk T^_^ 

+2j2S{n)PYiTk) + 2j2 S{TU)PYiTl 
T^ K-k 

n-1 , 

— E [Ey{SJ^^) + EYisl^lk) + 2Ey{Su) + 2Ey{S„ 

k=l 

-E^y(5f) + ;^^E^>'(^^)+" 



k=l k=l 



In particular 



o "-2 . n-2 



fe=i fc=i 



and therefore 



n-2 2 !^^' _,o,. 2 



^>-<*'-'>^H-;7^i:^-<^i"'+;7^^'-<^.'.'-^.) 



fe=i 

n-2 4'^„_. 4„_ . n-2 



E EY{Sk) + -EY{Sn-i) + • (n - 1) + 2 



n — 1 n — 2 ^-^ n — 1 n — 1 

fe=i 

= ^Ey{sI^\) + -^Ey{sI^\) + -^EY{Sr.-i) + 2 = ^Ey{S^^\) + 8F„_i - 6 

Setting Xn — EyiSn )/^, this recurrence becomes 

, 8ifn-i 6 
n n 

Since S*"' applied to a single node is 0, xi — Ey{Si ) = 0, and the solution of this recursive equation with 
this initial condition is 

x„ = E(^-D-«E^-6El-4(i7,^.-^i^^)-6(^«-i) 

fc=2 fc=l fc=2 

7 



from where we deduce that 

Ey{S^^'>) = nxn = MHn - H^n^) - 6'^(i^n - 1) 
as we claimed. D 

Theorem 3. EyiD^n^) ^ 8n(n + l){Hl - H^n^) - 2n(15n + 7)i7„ + 45n2 - n 
Proof. Again, we compute Ey (Dn ) using its very definition: 

n-l 



EyiDi?) ^ J2 D'^HT) -pyiT) = ^ E E E E D('\T,-TU) ■ py{TCTU) 

Ter„ fe=i SfcC{i "} Tfcer(5fc)T;_^Gr(s^,) 

|Sj.| = fc 

k=i \ / neTkT^_^er„-k 

— E (E E D'^^\n)Py{n)Py{TU) + Y.T. D^^\TU)Py{n)Py{TU) 



n 

fc=i Tfc t;_^ 



+2E E 5(T.)5(r;_jPy(rfc)Py(r;_,) + (n-fc)E E S^^\n)Py{n)Py{TU) 
+4(n-fc)E E 5(rfc)Pr(T,)Py(r^,) + fcE E 5(^nT^fc)Pr(T,)Py(T^_fc) 



n — k. 



+4kJ2 E ^(7^^fe)^y(rfe)Py(T;_fe)+4E E fc(ri-fc)Py(Tfe)Py(r^ 
^E(E^*'^(^'^)^^(rfc)+ E D('\TU)PYiTU) 

fc=l Tfc t;_^, 

+2(E^(7^fc)^i'(rfe))( E S{TU)PYiTU))+{n-k)J2s('\T,)Pym 



Tk T^_^ Tk 

-4{n-k)Y,S{n)Py{n) + kY, S^'\Z_,)Py{XU)+4k E S{Z_,)Py{TU)+Ak{n-k) 

T'' K-k K-k 

n-1 



1 / 

— E ( ^Y(^f ) + ^l-C^i'-^fe) + 2Ey{Sk)Ey{Sr.^k) + {n - k)Ey{Sf^) 
fe=l 

+4(n - k)Ey{Sk) + fc£;y(5i'2fc) + AkEy{Sn-k) + 4fc(n - fc)) 

9 U— i / \ 9 

— E (^yl-Of) + Sy(^fc)^y(5„_fe) + (n - k)Ey{S^^'^) + 4(n - fc)Sy(5fe)j + -n{n + 1) 



n 

fc=i 



In particular 

^Y(i^i'2l) = ^^ E ( ^>-(^f ) + ^y(5fe)^y(^n-l-fc) + {n-l- k)Ey{S^^'>) 
^ fe=l 

+4(n - 1 - k)Ey{Sk)] + tMu - 1) 



2 "-' 



and therefore 



n — 2 

n — 1 n — 2 ^-^ n — 1 



fe=i 

r» r» n—2 ^ / n — 1 n — 2 

n — 2 2 



T- 7^y EY{Sk)EY{Sn-l-k) -\ -\ > EY{Sk)EY{Sn-k) —y EY{Sk)EY{Sn-l-k) 

i n — 2 ^ — ' n — 1 \ -"^ — ' ^ — ' 



n — _ _ _ _ _ , 

fe=l fc=l *;=! 

r» r» n—2 p. /n— 1 n—2 

n-2 2 v-^, , ,,„ ,„fov 2 



E(n - 1 - fc)i?^(5f ) + -— ^ (^ E(n - fc)i?y (5f ) - 5](n - 1 - fc)i?y (5^ 



(2)^ 



n — 1 n — 2 .„ _ , 

fe=i fe=i fe=i 

r\ ^ n—2 ^ /n— 1 n—2 

n — 2 



E(n - 1 - fc)£;y(5fe) + ;^^( I](" - k)EY{Sk) - ^^(n - 1 - fc)Sy(5fc 



n — 1 n — 2 .... 

fe=i fc=i fc=i 

n-22, _,, 2, _,, n-22, 

—nin — 1) H — n(n +1) • —n(n — 1) 



n— 13 3 n— 13 

^EYiDl^l,) + -^Ey{d1^1,) + -^ E EYiSk)iEYiSn-k) EY{Sr.-k-l)) 

2 



- E ^^(^f ^) + — 7 E ^^(^fc) + 2" 

n — 1 -^-^ n — 1 ^^-^ 

o "-2 „ n-1 



-i?y(4'2i) + ^^-^ E ^(^'^ - l)^n-fc-i + ;73Y E(4M^fe - ^f ') - 6fc(iffc - 1)) 



fe=l fc=l 

16 ""' 

:!n 



- E fc(^fe - 1) + 2r 



n — ^ 

fe=i 

^ n— 2 ^ n— 2 

n „ .„f2'l ^ 8 V-^ _^ „ 8 



fe=i 

n-l 

- E HHk -l) + 2r 



fe=l /c=l fc=l 

. n— 1 

4 

^n 



n 

n-2 

n 



n—1 . n—1 

-J^kHk V fc + 2n 

1 -^-^ n—1 ■'^-^ 



fe=i fc=i fe=i 

. n—1 . n—1 

4 x-^ , „ 4 



n _ .„ _ 

fe=i fc=i 

j^n- 2 n — 2 ^n— 1 

n 



^i?W^^5i) + ^ E kHkH^-u-, - 8E ^^ + ^ E fc(^^' - <') 

fe=] 

n-l 

- E fc^fc - 8ff„-i 



fe=l fc=l fe=l 

12 "^' 



n 

fc=i 
n 



-i?y(i?i'2i) + 4n(F2 _ H^2) _ 2^^^ + 2) - 8(n - l)(i7„_i - 1) + 4n(i/2 _ i/(2)) 
-2n(2iJ„ - 1) + 3n(2iJ„ - 1) - 8i7„_i 

= ^^i;r(i?i'ii) + SniH^ - i/(2)) _ i4„i/„_^ + i5n _ 14 
n—1 

Setting Xn — EY{Dn )/«, this recurrence becomes 

Xn = a^n-l + KK - F(2)) _ 14i/„_i + 15 - — . 

n 



(2) 

The solution of this recursive with xi — Ey{D\ ) = is 

n ^ . 

k=2 

n n — 1 ^1 

= 8 J2{Hl - H^^) - 14 ^ iffe + 15(n - 1) - 14 ^ - 

fc=l k=l k=2 

= 8(n + l)(i/^+i - hI^I^) - 16(n + l)(i/„+i - 1) - 14n(H„ - 1) + 15(n - 1) - 14(ff„ - 1) 
= 8(n + l){Hl - i7^2)) _ 2(15n + 7)iJ„ + 45n - 1 

from where we deduce that 

EYiDl^^) = nxn = 8n(n + l)(ij2 - ij(^2)^ _ 2n(15n + 7)iJ„ + 45^2 - n 

as we claimed. D 



10 



